Sales forecasting plays a critical role in company growth, which provides insight into how companies manage their workforce, inventory, and cash flow. An accurate sale forecasting enables companies to build an appropriate promotional strategy and advertising campaign to reduce labor costs and financial expenditure. On the other hand, ineffective sales forecasting may lead to an undersupply or oversupply of inventory, resulting in negative consequences in the long term.
The purpose is to construct an accurate predictive model to forecast daily sales of Walmart retail goods for the next 28 days from 2016-05-23 to 2016-06-20. There are 30,490 unique items in this data set; however, I will only predict an item that has the highest aggregate sales in the past 1,941 days, from 2011-01-29 to 2014-05-22, due to time constraints. Additionally, the other goal of the project is to identify patterns and factors that influence sales by creating data visualizations.
The data source of the project is from a data science competition, “M5 Forecasting” on Kaggle. There are three data sets, namely sales_train_evaluation.csv, sell_prices.csv, and calendar.csv.
sales_train_evaluation.csv: Includes daily sales 2011-01-29 to 2014-05-22. (1914 days in total)
calendar.csv: Contains information about the dates on which the products are sold.
sell_prices.csv: Contains information about the price of the products sold per store and date.
This is a hierarchical data set:
There are ten stores in total, 3 of them in the State of CA, 3 of them in the State of TX, and 3 of them in the State of WI.
There are three categories that are hobbies, foods, and household in these ten stores; there are 3049 items in each store.
There are ten departments in total, 2 of them belonging to the hobbies category, 3 of them belonging to the foods category, and 2 of them belonging to the household category.
Thus, there are 59,181,090 (3049 items * 10 stores * 1941 days) rows in total in the merged data set.
Additionally, the competition requests the participant to take the daily sales from 2011-01-29 to 2014-04-24 (1913 days in total) as the training data set and the daily sales from 2011-04-25 to 2014-05-22 (28 days in total) as the testing data set.
After splitting the data sets, there are 58,327,370 rows and 19 for the training data set and 853,720 rows and 19 columns for the testing data set. These two data sets all contain information about the daily sales of 3,049 items at ten stores in the States of CA, TX, and WI, the daily sell price, category, and event.
NA otherwise.NA otherwise.NA otherwise.NA otherwise.The table below indicates the first five rows of the training data set.
## date wm_yr_wk weekday wday month year d event_name_1 event_type_1
## 1: 2011-01-29 11101 Saturday 1 1 2011 d_1
## 2: 2011-01-29 11101 Saturday 1 1 2011 d_1
## 3: 2011-01-29 11101 Saturday 1 1 2011 d_1
## 4: 2011-01-29 11101 Saturday 1 1 2011 d_1
## 5: 2011-01-29 11101 Saturday 1 1 2011 d_1
## event_name_2 event_type_2 id item_id
## 1: HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001
## 2: HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002
## 3: HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003
## 4: HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004
## 5: HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005
## dept_id cat_id store_id state_id sales sell_price
## 1: HOBBIES_1 HOBBIES CA_1 CA 0 NA
## 2: HOBBIES_1 HOBBIES CA_1 CA 0 NA
## 3: HOBBIES_1 HOBBIES CA_1 CA 0 NA
## 4: HOBBIES_1 HOBBIES CA_1 CA 0 NA
## 5: HOBBIES_1 HOBBIES CA_1 CA 0 NA
The table below indicates the first five rows of the testing data set.
## date wm_yr_wk weekday wday month year d event_name_1 event_type_1
## 1: 2016-04-25 11613 Monday 3 4 2016 d_1914
## 2: 2016-04-25 11613 Monday 3 4 2016 d_1914
## 3: 2016-04-25 11613 Monday 3 4 2016 d_1914
## 4: 2016-04-25 11613 Monday 3 4 2016 d_1914
## 5: 2016-04-25 11613 Monday 3 4 2016 d_1914
## event_name_2 event_type_2 id item_id
## 1: HOBBIES_1_001_CA_1_evaluation HOBBIES_1_001
## 2: HOBBIES_1_002_CA_1_evaluation HOBBIES_1_002
## 3: HOBBIES_1_003_CA_1_evaluation HOBBIES_1_003
## 4: HOBBIES_1_004_CA_1_evaluation HOBBIES_1_004
## 5: HOBBIES_1_005_CA_1_evaluation HOBBIES_1_005
## dept_id cat_id store_id state_id sales sell_price
## 1: HOBBIES_1 HOBBIES CA_1 CA 0 8.38
## 2: HOBBIES_1 HOBBIES CA_1 CA 0 3.97
## 3: HOBBIES_1 HOBBIES CA_1 CA 0 2.97
## 4: HOBBIES_1 HOBBIES CA_1 CA 0 4.64
## 5: HOBBIES_1 HOBBIES CA_1 CA 1 2.88
The plots indicate that the total sales in CA are higher than in TX or in WI.
However, they all share the same seasonal patterns over the period. The sales at the beginning and the end of a year are lower than at any other time.
The bar plot indicates that the total sales of foods are much higher than the sales of hobbies and household.
Additionally, both the sales of the foods category and hobbies categories show an upward trend over the year. Also, both foods and hobbies share a similar seasonal pattern, the sales in Winter much lower than in other seasons.
The FOODS_3 category has the highest sales over the period, which almost takes up 50% of total sales.
The HOBBIES_2 category has the lowest sales, which only takes up 1% of total sales.
The categories FOODS_2 and HOUSEHOLD_1 have an upward trend over time. If an increase in sales is due to an increase in demand instead of supply, I will recommend the company make a further analysis to decide whether these two categories have a higher potential for profit.
Since the original training data set spans from 2011-01-29 to 2014-04-24. To accurately compare the total sales data, I only use the data from 2012-01-01 to 2015-12-31.
We can find the sales in Winter is lower than other seasons as the previous result of the plot “Total sales by state.”
As I mentioned before, to accurately compare the total sales data, I only use the data from 2011-01-29 to 2016-01-29.
Overall, there is an upward trend over the year.
The line plots show different patterns for these three random samples. For instance, the historical sales of ‘FOODS_1_137_WI_2_evaluation’ started from 2011-02-01; however, the sales of other samples show a totally different pattern.
Thus, since it is hard to build an accurate model for all items when their sales patterns are all different. I decide to choose the item that has the highest sales volume to predict its sales in the following 28 days.
The table below shows the Top3 best-selling items from 2011-01-29 to 2016-04-24.
Since id FOODS_3_090_CA_3_evaluation has the highest sales (250,502 units in total), I will predict its sales for the next 28 days from 2016-05-23.
## # A tibble: 3 x 2
## id total_sales
## <chr> <int>
## 1 FOODS_3_090_CA_3_evaluation 250502
## 2 FOODS_3_586_TX_2_evaluation 192835
## 3 FOODS_3_586_TX_3_evaluation 150122
The goal is to forecast daily sales rather than a monthly or yearly trend, so including too much data may decrease accuracy.
The line plots below indicate that the sales patterns are different each year. Although the patterns in 2014 and 2015 are similar, I would not infer that a sales pattern will be the same in 2016. Additionally, our primary goal is to predict the sales demand instead of sales supply. There is no additional information that can prove that the sales will have the same pattern as the same period of the previous years. The line plot “Total sales in 2016” indicates no sales before 2016-01-07, so I only use the date after 2016-01-06.
The heatmap below shows a weekly pattern from 2011-01-01 to 2016-04-24. Also, I would like to split the data into a ratio of 7.5:2.5 (84:28). Hence, only 84 days (7 days * 12 weeks) before 2016-04-25 will be taken as the training data set.
The assumptions of ARIMA model is that the data should be stationary so that the properties of the series does not depend on the time when it is captured. Thus, I run the KPSS test to check for stationarity of the time series.
Since p value 0.1 > 0.05, we do not reject the null hypothesis: the data is stationary. We can conclude that the time data is stationary and the difference is not needed.
## # A tibble: 1 x 2
## kpss_stat kpss_pvalue
## <dbl> <dbl>
## 1 0.103 0.1
Although the data pass the KPSS test, the ACF plot shows a positive spike at every 7-lag, indicating a high correlation between the demand of the item on the present day and its demand before seven days. Thus, to remove a consistent seasonal pattern, I decide to take a seasonal difference first.
The seasonally differenced data are shown below. However, the series still has positive autocorrelation out to a high number of lags, so I plan to take a further first difference.
The plots below indicate the data is stationary after I take a seasonal difference and a difference.
A Seasonal ARIMA model will be built based on the pattern shown in the ACF and the PACF. The Seasonal ARIMA is composed of a non-seasonal part (p,d,q) and a seasonal part (P,D,Q). It is written as follows: ARIMA (p,d,q) (P,D,Q)[m] , where m = the seasonal period. Thus, in this case, ARIMA(0,1,1)(0,1,1)[7] is chosen as my Seasonal ARIMA model. The reasons are below:
PACF tails off, and the values close to zero over many lags.
A sharp cutoff and the negative autocorrelation at lag 1 in the ACF display that the series appears slightly over-differenced.
The first two reasons suggest a non-seasonal MA(1) term.
The series takes one difference, so the value of d is 1. Hence, non-seasonal part (p,d,q) will be (0,1,1).
The significant spike at lag 12 in the ACF suggests a seasonal MA(1) component.
We usually avoid mixing SAR(P) and SMA(Q) terms in the same model and avoid using more than one of either kind.
The series takes one seasonal difference, so the value of d is 1. Hence, seasonal part (P,D,Q) will be (0,1,1).
## Series: sales
## Model: ARIMA(0,1,1)(0,1,1)[7]
##
## Coefficients:
## ma1 sma1
## -1.0000 -0.9999
## s.e. 0.0577 0.1606
##
## sigma^2 estimated as 582.1: log likelihood=-360.9
## AIC=727.81 AICc=728.14 BIC=734.8
## Series: sales
## Model: ARIMA(2,0,1)(0,1,2)[7]
##
## Coefficients:
## ar1 ar2 ma1 sma1 sma2
## 0.8891 -0.2196 -0.8132 -1.0356 0.1219
## s.e. 0.2741 0.1274 0.2961 0.2150 0.1683
##
## sigma^2 estimated as 592.2: log likelihood=-359.31
## AIC=730.62 AICc=731.82 BIC=744.68
## Series: sales
## Model: ARIMA(3,1,0)(2,1,1)[7]
##
## Coefficients:
## ar1 ar2 ar3 sar1 sar2 sma1
## -0.5487 -0.4208 -0.2692 -0.1149 -0.1363 -0.8893
## s.e. 0.1168 0.1197 0.1070 0.1819 0.1569 0.3183
##
## sigma^2 estimated as 826.6: log likelihood=-367.02
## AIC=748.05 AICc=749.69 BIC=764.36
## Series: sales
## Model: ETS(M,N,A)
## Smoothing parameters:
## alpha = 0.0001001174
## gamma = 0.0001004647
##
## Initial states:
## l s1 s2 s3 s4 s5 s6 s7
## 110.3713 23.10466 50.72791 20.34791 -16.44099 -24.48968 -27.08718 -26.16264
##
## sigma^2: 0.0494
##
## AIC AICc BIC
## 914.8099 917.8236 939.1181
The table below indicates that the manual ARIMA “ARIMA(0,1,1)(0,1,1)[7]” is the best model because it has the lowest AICc among all models. In the next step, I will perform a residual diagnostic for all models to check if the assumption is met.
## .model model AIC AICc BIC
## 1 my_sarima ARIMA(0,1,1)(0,1,1)[7] 727.8078 728.1411 734.8000
## 2 auto_sarima ARIMA(2,0,1)(0,1,2)[7] 730.6170 731.8170 744.6798
## 3 auto_sarima_d ARIMA(3,1,0)(2,1,1)[7] 748.0474 749.6944 764.3625
## 4 auto_ets ETS(M,N,A) 914.8099 917.8236 939.1181
We can find a correlation of the residuals from the ACF plot of model residuals. If more than 95% of lags are between the two blue dotted lines, we can conclude that the residuals are not autocorrelated.
The ACF of the residuals plot from the ARIMA(0,1,1)(0,1,1)[7] model below shows that the residuals are white noise. Because all lags fall inside the 95 % confidence bounds, indicating the residuals appear to be random.
The ACF of the residuals plot from the ARIMA(2,0,1)(0,1,2)[7] model below shows that the residuals are white noise. Because all lags fall inside the 95 % confidence bounds, indicating the residuals appear to be random.
The ACF of the residuals plot from the ARIMA(3,1,0)(2,1,1)[7] model below shows that the residuals are white noise, although there is one lag outside the 95 % confidence bounds. However, since 1/30 = 0.03 is less than 0.05, we still can consider the residuals to be not auto-correlated.
The ACF of the residuals plot from the ETS(M,N,A) model below shows that the residuals are white noise. Because all lags fall inside the 95 % confidence bounds, indicating the residuals appear to be random.
Although the residuals plots of all models look nice, I still decide to check the autocorrelation with Ljung-Box Test. The table reveals that the residuals from all models are white noise since the p-value of them is more than 0.05. Thus, we do not reject the null hypothesis and can conclude that the residuals are not autocorrelated.
## model .model lb_stat lb_pvalue
## 1 ARIMA(0,1,1)(0,1,1)[7] my_sarima 29.92788 0.3666739
## 2 ARIMA(2,0,1)(0,1,2)[7] auto_sarima 22.09206 0.6304225
## 3 ARIMA(3,1,0)(2,1,1)[7] auto_sarima_d 31.40154 0.1425391
## 4 ETS(M,N,A) auto_ets 32.05667 0.3648964
The manual ARIMA “ARIMA(0,1,1)(0,1,1)[7]” is the best model. These are the reasons:
RMSE is the evaluation indicator since we want the model to have the most accurate prediction. The differences between the predicted values and the actual values should be small and unbiased. Thus, the my_sarima model ARIMA(0,1,1)(0,1,1)[7] is the best model since it has the lowest RMSE value 27.20693 on the test set.
Besides, the RMSE for the training and the testing data sets should be similar if the model is good. The column “difference” displays the difference between RMSE values for the testing data set and training data set. Hence, the my_sarima model ARIMA(0,1,1)(0,1,1)[7] is the best model since it shows the smallest difference.
As previously stated, the ** my_sarima** model also has the smallest AICc value when we fit the model.
## # A tibble: 4 x 6
## .model .type.x RMSE.x .type.y RMSE.y difference
## <chr> <chr> <dbl> <chr> <dbl> <dbl>
## 1 auto_sarima Training 22.5 Test 28.4 5.91
## 2 auto_sarima_d Training 26.2 Test 30.9 4.70
## 3 my_sarima Training 22.6 Test 27.2 4.56
## 4 auto_ets Training 22.8 Test 27.8 5.07
The plot below compares the predicted sales from the different models with the actual sales shown in black color. The predicted values from the four models look similar. Even though their predicted values are different from the actual values, their trends are identical to the trend of the actual values.
In this section, I will use the model ARIMA(0,1,1)(0,1,1)[7] to forecast the daily sales of Walmart retail goods for the following 28 days from 2016-05-23 to 2016-06-20 using the data from 2016-02-01 to 2016-05-22.
Forecasts from ARIMA(0,1,1)(0,1,1)[7] for the next 28 days are shown below. We can find our model captured sales seasonality.
The predicted daily sales for 28 days from 2016-05-23 to 2016-06-20 are shown below.
## .model date sales .mean
## 1 Best_SARIMA 2016-05-23 N(95.93752, 668.5512) 95.93752
## 2 Best_SARIMA 2016-05-24 N(94.12502, 668.5512) 94.12502
## 3 Best_SARIMA 2016-05-25 N(96.18752, 668.5512) 96.18752
## 4 Best_SARIMA 2016-05-26 N(98.93752, 668.5512) 98.93752
## 5 Best_SARIMA 2016-05-27 N(142.1875, 668.5512) 142.18752
## 6 Best_SARIMA 2016-05-28 N(166.6875, 668.5512) 166.68752
## 7 Best_SARIMA 2016-05-29 N(142.4375, 668.5512) 142.43751
## 8 Best_SARIMA 2016-05-30 N(96.7537, 673.1779) 96.75370
## 9 Best_SARIMA 2016-05-31 N(94.9412, 673.1779) 94.94120
## 10 Best_SARIMA 2016-06-01 N(97.0037, 673.1779) 97.00370
## 11 Best_SARIMA 2016-06-02 N(99.7537, 673.1779) 99.75370
## 12 Best_SARIMA 2016-06-03 N(143.0037, 673.1779) 143.00370
## 13 Best_SARIMA 2016-06-04 N(167.5037, 673.1779) 167.50370
## 14 Best_SARIMA 2016-06-05 N(143.2537, 673.1779) 143.25369
## 15 Best_SARIMA 2016-06-06 N(97.56988, 678.3186) 97.56988
## 16 Best_SARIMA 2016-06-07 N(95.75738, 678.3186) 95.75738
## 17 Best_SARIMA 2016-06-08 N(97.81988, 678.3186) 97.81988
## 18 Best_SARIMA 2016-06-09 N(100.5699, 678.3186) 100.56988
## 19 Best_SARIMA 2016-06-10 N(143.8199, 678.3186) 143.81988
## 20 Best_SARIMA 2016-06-11 N(168.3199, 678.3186) 168.31988
## 21 Best_SARIMA 2016-06-12 N(144.0699, 678.3186) 144.06987
## 22 Best_SARIMA 2016-06-13 N(98.38606, 683.9734) 98.38606
## 23 Best_SARIMA 2016-06-14 N(96.57356, 683.9734) 96.57356
## 24 Best_SARIMA 2016-06-15 N(98.63606, 683.9734) 98.63606
## 25 Best_SARIMA 2016-06-16 N(101.3861, 683.9734) 101.38606
## 26 Best_SARIMA 2016-06-17 N(144.6361, 683.9734) 144.63606
## 27 Best_SARIMA 2016-06-18 N(169.1361, 683.9734) 169.13606
## 28 Best_SARIMA 2016-06-19 N(144.886, 683.9734) 144.88605
Briefly summarize the project:
I analyze Walmart’s sales pattern of goods, sales growth rate, and promotion strategies in data visualizations.
Due to the resource limit, I only predict the best-selling item ‘FOODS_3_090_CA_3_evaluation’ for the next 28 days from 2016-05-23 to 2016-06-20.
There is a 7-days pattern in the data, so I choose the Seasonal ARIMA and ETS models to be candidate models.
I select the most accurate model among the four models with the lowest RMSE value on the testing data set.
I use the best model to forecast the 28-day sales from 2016-05-23 to 2016-06-20, and the predicted values display a weekly seasonal pattern as we expected.
Nevertheless, the model ignores the predictor “event” and the low sales values since no evidence indicates the sales will be lower or higher on specific dates, which may result in inaccurate predictive values on dates when the item is not for sale. Thus, more information about the item is needed. With the additional information, we can build a more accurate model using a Prophet model that allows inputs of customized recurring events.
Besides, suppose there is a transaction data set that allows us to analyze consumer buying patterns. In that case, we can use market basket analysis to uncover associations between items to help the company make a more precise marketing strategy. For example, Walmart discovered that the sales of diapers and beer were correlated on Friday nights. After the Walmart team moved these two items closer, the sales of both items increased significantly.
To sum up, this predictive model can assist Walmart in having better control of the inventory of goods and reduce the financial expenditure, especially for the products with a high cost.